Sixth Lab

Today’s goals. Learn about:

  1. Logistic Regression.
  2. Poisson Regression.

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Generalized Linear Models

GLMs

Many times, linearity between a dependent variable and a set of predictors is not a suited assumption. Normal linear models rely on a normality assumption, and this may be a limitation when the dependent variable is discrete, e.g. with support \(\{0,1\}\) or \(\{0,1,2, \ldots, \}\), or it is continuous and its range spans from 0 to \(+\infty\). In normal linear models, the main assumption is (including the intercept \(x_{ij} = 1\)): \[ E[Y_i|{\bf x}_i]= \eta_i = \sum_{j=1}^{p}\beta_jx_{ij}.\] Instead, the generalized linear model consider modelling a suitable transformation of the conditional expectation by using covariates for a broad range of distributions, that is (including the intercept \(x_{ij} = 1\))
\[g(E[Y_i|{\bf x}_i])= \eta_i =\sum_{j=1}^{p}\beta_jx_{ij}.\] The main ingredients of a GLM are: the distibution of \(Y_i\), which belongs to the exponential dispersion family, the linear predictor (\(\eta_i\)), and the link function \(g(\cdot)\) that must be twice differentiable.

Logistic regression

Example

The following example is from Chapter 5 of Data analysis using regression and multilevel/hierarchical models, Gelman, A., & Hill, J. (2007), Cambridge University Press.

It is known that water in some wells in Bangladesh and other South Asian countries might be contaminated with natural arsenic. The effect of the arsenic ingestion on health is a cumulative poisoning, and the risk of cancer and other diseases is estimated to be proportional to the dose intake.

A research team from the United States and Bangladesh examined the water of several wells in Araihazar district of Bangladesh and they measured the arsenic level classifying the wells as safe if the arsenic level was below the 0.5 in units of hundreds of micrograms per liter, or unsafe if it was above 0.5. All the people using unsafe wells have been encouraged to switch to a nearby well. Indeed the contamination does not depend on the proximity and close wells can have very different levels of arsenic.

Logistic regression

Example

The dataset in the file Wells.csv contains information on whether or not 3020 households changed the unsafety wells after few years from the arsenic investigation. The variables are

  • switch: whether or not the household switched to another well from an unsafe well: no (0) or yes (1);
  • arsenic: the level of arsenic contamination in the household’s original well (in hundreds of micrograms per liter); note that all are above 0.5, which was the level identified as “unsafe”;
  • distance: distance to the closest known safe well (in meters);
  • education: education of the head of the household (in years);
  • association: whether or not any members of the household participated in any community organizations: no (0) or yes (1).

Goal: Using a logistic regression, we would determine which are the main factors of well switching among the users of unsafe wells.

We start by reading the data, by analysing the structure and exploring some characteristics of the data.

Logistic regression

data <- read.csv2("Wells.csv", header = TRUE)
str(data)
'data.frame':   3020 obs. of  5 variables:
 $ switch : int  1 1 0 1 1 1 1 1 1 1 ...
 $ arsenic: num  2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
 $ dist   : num  16.8 47.3 21 21.5 40.9 ...
 $ assoc  : int  0 0 0 0 1 1 1 0 1 1 ...
 $ educ   : int  0 0 10 12 14 9 4 10 0 0 ...
summary(data)
     switch          arsenic           dist             assoc       
 Min.   :0.0000   Min.   :0.510   Min.   :  0.387   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.820   1st Qu.: 21.117   1st Qu.:0.0000  
 Median :1.0000   Median :1.300   Median : 36.761   Median :0.0000  
 Mean   :0.5752   Mean   :1.657   Mean   : 48.332   Mean   :0.4228  
 3rd Qu.:1.0000   3rd Qu.:2.200   3rd Qu.: 64.041   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :9.650   Max.   :339.531   Max.   :1.0000  
      educ       
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 5.000  
 Mean   : 4.828  
 3rd Qu.: 8.000  
 Max.   :17.000  

Logistic regression

par(mfrow = c(1, 2))
boxplot(dist ~ switch, data = data, horizontal = TRUE,
        xlab = "Distance (in meters)", ylab = "Switch")
boxplot(arsenic ~ switch, data = data, horizontal = TRUE,
        xlab = "Arsenic (in hundreds of mg per liter)", ylab = "Switch")

Logistic regression

par(mfrow = c(1, 1))
barplot(prop.table(table(data$switch, data$educ), margin = 1), beside = TRUE, 
        legend.text = c("no switch", "switch"), xlab = "Education (years)")

Logistic regression: Single predictor

Single predictor

We start by modelling the effect of the variable distance on the variable switch. Since the outcome variable is a binary variable, the model chosen is the logistic regression model. \[ \rm{Pr}(Y_i=1| dist_i)=E(Y_i | dist_i)=p_i \quad \quad \text{logit}(p_i)= \eta_i = \beta_1 +\beta_2 dist_i\] where \(\text{logit}(x)=\log(x/(1-x))\) is a function that maps the range \((0,1)\) to the range \((-\infty,\infty)\).


Call:
glm(formula = switch ~ dist, family = binomial(link = "logit"), 
    data = data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.6059594  0.0603102  10.047  < 2e-16 ***
dist        -0.0062188  0.0009743  -6.383 1.74e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 4076.2  on 3018  degrees of freedom
AIC: 4080.2

Number of Fisher Scoring iterations: 4

Logistic regression: Single predictor

Intepretation

The intercept (\(\beta_1\)) can only be interpreted assuming zero values for the other predictors. Here, when dist\(=0\) the estimated probability of switching for a family who lives close to a safe well is about \(65\%\), given by \[\hat P(Y_i = 1 | dist_i = 0) = \frac{e^{\hat \beta_1}}{1+e^{\hat \beta_1}}= \frac{1}{1+e^{-\hat \beta_1}}\]

InvLogit <- function(eta) 1/(1 + exp(-eta))
as.numeric(InvLogit(c(1,0) %*% coef(fit1)))
[1] 0.6470185
predict(fit1, newdata = data.frame(dist = 0), type = "response")
        1 
0.6470185 

Logistic regression: Single predictor

Intepretation

Interpreting the coefficients for the predictor dist on the log-odds scale: the coefficient for dist can be interpreted as a difference of \(-0.0062\) in the estimated logit probability of switching well when the distance is increased by one, that is

\[{\rm logit}(P(Y_i=1|dist_i=c+1) =\beta_1 + (c + 1) \beta_2\] \[{\rm logit}(P(Y_i=1|dist_i=c) =\beta_1 + c \beta_2 \] \[{\rm logit}(P(Y_i=1|dist_i=c+1) - {\rm logit}(P(Y_i=1|dist_i=c) = \beta_2\]

From the previous expression, we can obtain the log-odds ratio \[\log\Bigg(\frac{P(Y_i=1|dist_i=c+1)/P(Y_i=0|dist_i=c+1)}{P(Y_i=1|dist_i=c)/P(Y_i=0|dist_i=c)}\Bigg)=\beta_2\] and so \[\frac{P(Y_i=1|dist_i=c+1)}{P(Y_i=0|dist_i=c+1)} = \exp(\beta_2) \frac{P(Y_i=1|dist_i=c) }{P(Y_i=0|dist_i=c)}.\] The ratio between the probability of success (in this case to switch well) and of failure (that is the odds), when \(dist=c+1\) is estimated to be \(\exp(\hat \beta_2) = 0.99\) times the odds when \(dist=c\)

exp(coef(fit1)[2])
     dist 
0.9938005 

Logistic regression: Single predictor

Intepretation

Interpreting the coefficients for the predictor dist on the probability scale: we can evaluate the difference of the logistic regression function by adding 1 to the predictor dist. This difference corresponds to a difference on the probability scale but requires the choice of the specific value of the predictor, that is \[P(Y_i = 1 | dist_i = c+1) - P(Y_i = 1 | dist_i = c) = \frac{1}{1+\exp(-(\beta_1 + (c+1) \times\beta_2))} - \frac{1}{1+\exp(-(\beta_1 + c \times\beta_2))}\] The mean of the predictor is a good starting point since it corresponds to the steepest point of the inverse logit function. Thus, adding 1 to dist (that is, adding 1 meters to the distance to the nearest safe well) corresponds to a negative difference of about \(0.15\%\) in the estimated probability of switching.

as.numeric(InvLogit(c(1, mean(data$dist) + 1) %*% coef(fit1))-
           InvLogit(c(1, mean(data$dist)) %*% coef(fit1)))
[1] -0.001519722

Intepretation

Of course, you can evaluate the (estimated) probability difference between two different distances.

as.numeric(InvLogit(c(1, 300) %*% coef(fit1)) - InvLogit(c(1, 0) %*% coef(fit1)))
[1] -0.4259907

Logistic regression: Single predictor

Intepretation

Note that the coefficients here does not have a linear effect on the probability that the outcome is 1 because of the non linearity of the model. The curve of the logistic function requires us to choose where to evaluate changes, if we want to interpret on the probability scale. Below the difference on the (estimated) probability of switching well due to an increase of one unit on the predictor from the 99th percentile

as.numeric(InvLogit(c(1, quantile(data$dist, 0.99) + 1) %*% coef(fit1)) -
           InvLogit(c(1, quantile(data$dist, 0.99)) %*% coef(fit1)))
[1] -0.001465513

Intepretation

The effect of one unit increase of the variable dist on the inverse logit seems low, but this is misleading. Indeed, the distance is measured in meters, so this coefficient corresponds to the difference between, say, a house that is 48 meters away from the nearest safe well and a house that is 49 meters away. To improve interpretability of the results we rescale distance in 100-meter units. Thus, adding 100 meters to the distance to the nearest safe well corresponds to a negative difference in the probability of switching of about \(15\%\).

Logistic regression: Single predictor

Intepretation

  • Note that \(\hat \beta_2\) of the new model is 100 times the \(\hat \beta_2\) of the original one

  • Also note the following equivalence: the difference on probability (of switching well) scale for a unit increase of the new model corresponds to 100 times the same quantity in the original model


Call:
glm(formula = switch ~ dist100, family = binomial(link = "logit"), 
    data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.60596    0.06031  10.047  < 2e-16 ***
dist100     -0.62188    0.09743  -6.383 1.74e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 4076.2  on 3018  degrees of freedom
AIC: 4080.2

Number of Fisher Scoring iterations: 4
[1] -0.1542287
[1] -0.1542287

Logistic regression: Single predictor

Intepretation

The probability of switching well is estimatated to be about \(60\%\) for those who live close to a safe well, declining to about \(20\%\) when the distance from a safe well increases up to \(300\) meters from their home. This seems reasonable: the probability of switching is higher for people who live closer to a safe well (Gelman and Hill, 2007).

switch.jitter <- jitter(data$switch, factor = 0.10)

par(mfrow = c(1, 1))
with(data, plot(dist, switch.jitter,
                xlab = "Distance (in meters) to nearest safe well",
                ylab = "Pr(Swiching)", type = "n"))
curve(InvLogit(coef(fit1)[1] + coef(fit1)[2] * x), lwd = 1, add = TRUE)
points(data$dist, switch.jitter, pch = 20, cex = .1)

Logistic regression: Adding predictors

Additional predictors

We consider adding the arsenic level in our model. We expect that the higher the arsenic concentration the higher the probability of switching well, that is, translated in model terms, a positive sign for the arsenic coefficient. We plot the histogram of arsenic concentration before fitting the model and analysing the results.

par(mfrow = c(1,1))
hist(data$arsenic, freq = TRUE, xlab = "Arsenic concentration",
     breaks = seq(0, 0.25 + max(data$arsenic), 0.25), main = "")

Logistic regression: Adding predictors

Additional predictors

Basically, we fitted the model \[\text{logit}(P(Y_i|dist^{*}_i, arsenic_i))= \eta_i = \beta_1 +\beta_2 \texttt{dist}^{*}_i + \beta_3 \texttt{arsenic}_i\]

(note \(dist^* = dist/100\))

fit3 <- glm(switch ~ dist100 + arsenic, data = data, family = binomial("logit"))
summary(fit3)

Call:
glm(formula = switch ~ dist100 + arsenic, family = binomial("logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.002749   0.079448   0.035    0.972    
dist100     -0.896644   0.104347  -8.593   <2e-16 ***
arsenic      0.460775   0.041385  11.134   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3930.7  on 3017  degrees of freedom
AIC: 3936.7

Number of Fisher Scoring iterations: 4

Logistic regression: Adding predictors

Interpretation

  • Given two wells with the same arsenic level (notice that it cannot be 0!), the estimated logit probability of switching decreases of \(.9\) every 100 meters in distance to the nearest safe well. \[\text{logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) - \text{logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) = \beta_2\]
  • For two equally distant nearest wells, a difference of 1 in arsenic concentration corresponds to a 0.46 positive difference in the logit probability of switching. \[\text{logit}(P(Y_i|dist^*_i=c, arsenic_i = k+1)) - \text{logit}(P(Y_i|dist^*_i=c, arsenic_i = k)) = \beta_3\]

Logistic regression: Adding predictors

Note

The following plots show the fitted model with two predictors. Numerical interpretation of one predictor effect on probability scale needs to be done choosing a value for the other predictor.

  • Left: probability of switching well vs distance when arsenic is equal to 0.5 and 1.5

  • Right: probability of switching well vs arsenic when the distance is equal to 0 and 50

par(mfrow = c(1, 2))
with(data, plot(dist100, switch.jitter, xlab = "Distance", ylab = "Pr(Switching)", type = "n"))
curve(InvLogit(coef(fit3)[1] + coef(fit3)[2] * x + coef(fit3)[3] * 1.5), add = TRUE, col = "blue")
curve(InvLogit(coef(fit3)[1] + coef(fit3)[2] * x + coef(fit3)[3] * 0.5), add = TRUE, col = "red")
points(data$dist100, switch.jitter, pch = 20, cex = .1)
text (0.50, .27, "if As = 1.5", adj = 0, cex = .8, col = "red")
text (0.75, .50, "if As = 1.0", adj = 0, cex = .8, col = "blue")

with(data, plot(arsenic, switch.jitter, xlab = "Arsenic", ylab = "Pr(Switching)", type = "n"))
curve(InvLogit(coef(fit3)[1] + coef(fit3)[3] * x), add = TRUE, col = "red")
text(1.5, 0.8, "if dist=0", col = "red")
curve(InvLogit(coef(fit3)[1] + coef(fit3)[2] * 0.5 + coef(fit3)[3] * x), add = TRUE, col = "blue")
text(4, 0.6, "if dist=50", col = "blue")
points(data$arsenic, switch.jitter, cex = 0.01, pch = 20)

Logistic regression: Adding predictors

Interpretation

The following plots show the fitted model with two predictors. Numerical interpretation of one predictor effect on probability scale needs to be done choosing a value for the other predictor.

  • Left: probability of switching well vs distance when arsenic is equal to 0.5 and 1.5

  • Right: probability of switching well vs arsenic when the distance is equal to 0 and 50

Note

In general terms, the fitted model can be summarized as: switching is easier if there is a nearby safe well, and if a household’s existing well has a high arsenic level, there should be more motivation to switch.

Logistic regression: Including interaction term

Interactions

We now want to verify if there is a statistically significant effect of the interaction between the distance and the arsenic concentration on the probability of switching well.

fit4 <- glm(switch ~ dist100 * arsenic,  
            data = data, family = binomial(link = "logit"))
pred4 <- predict(fit4, type = "response")
summary(fit4)

Call:
glm(formula = switch ~ dist100 * arsenic, family = binomial(link = "logit"), 
    data = data)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -0.14787    0.11754  -1.258  0.20838    
dist100         -0.57722    0.20918  -2.759  0.00579 ** 
arsenic          0.55598    0.06932   8.021 1.05e-15 ***
dist100:arsenic -0.17891    0.10233  -1.748  0.08040 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4118.1  on 3019  degrees of freedom
Residual deviance: 3927.6  on 3016  degrees of freedom
AIC: 3935.6

Number of Fisher Scoring iterations: 4

Logistic regression: Including interaction term

Interactions

We fix the predictors to their mean and so the estimated probability of switching well for a averaged distance and averaged arsenic concentration is about 0.59. \[\hat P(Y_i = 1|dist^*_i = \overline{dist^*}, arsenic_i = \overline{arsenic}) = \frac{\exp(\hat \beta_1 + \hat \beta_2 \overline{dist^*} + \hat \beta_3 \overline{arsenic} + \hat \beta_4 \overline{dist^*} \times \overline{arsenic})}{1+\exp(\hat \beta_1 + \hat \beta_2 \overline{dist^*} + \hat \beta_3 \overline{arsenic} + \hat \beta_4 \overline{dist^*} \times \overline{arsenic})}\]

as.numeric(InvLogit(c(1,mean(data$dist100), mean(data$arsenic),
           mean(data$dist100)*mean(data$arsenic))%*%coef(fit4)))
[1] 0.5868829
predict(fit4, newdata = data.frame(dist100 = mean(data$dist100), 
                                   arsenic = mean(data$arsenic)), 
        data = data, type = "response")
        1 
0.5868829 

Logistic regression: Including interaction term

Interactions

At the mean level of arsenic in the data, each 100 meters of distance corresponds to an approximate 22% negative difference in probability of switching. \[\hat P(Y_i = 1|dist^*_i = c + 1, arsenic_i = \overline{arsenic}) - \hat P(Y_i = 1|dist^*_i = c , arsenic_i = \overline{arsenic})= \] \[=\frac{\exp(\hat \beta_1 + \hat \beta_2 (c + 1) + \hat \beta_3 \overline{arsenic} + \hat \beta_4 (c+1) \times \overline{arsenic})}{1+\exp(\hat \beta_1 + \hat \beta_2 (c+1) + \hat \beta_3 \overline{arsenic} + \hat \beta_4 (c+1) \times \overline{arsenic})}- \frac{\exp(\hat \beta_1 + \hat \beta_2 c + \hat \beta_3 \overline{arsenic} + \hat \beta_4 c \times \overline{arsenic})}{1+\exp(\hat \beta_1 + \hat \beta_2 c + \hat \beta_3 \overline{arsenic} + \hat \beta_4 c \times \overline{arsenic})}\]

mean_d100 <- mean(data$dist100); mean_ars <- mean(data$arsenic)
as.numeric(InvLogit(c(1, mean_d100 + 1, mean_ars, (mean_d100+1) * mean_ars) %*%coef(fit4)) -
             InvLogit(c(1, mean_d100, mean_ars, mean_d100 * mean_ars) %*% coef(fit4)))
[1] -0.2146287

Logistic regression: Including interaction term

Interactions

At the mean level of distance in the data, each additional unit of arsenic corresponds to an approximate 11% positive difference in probability of switching (the mathematical expression is similar to the expression above)

as.numeric(InvLogit(c(1, mean_d100, mean_ars + 1, mean_d100 * (mean_ars+1)) %*%coef(fit4)) -
             InvLogit(c(1, mean_d100, mean_ars, mean_d100 * mean_ars) %*% coef(fit4)))
[1] 0.1074813

Logistic regression: Including interaction term

Interactions

The interaction term can be seen (on the logit scale): as the effect (-0.18) added to the coefficient for distance for each additional unit of arsenic; indeed

\[{\rm logit}(P(Y_i|dist^*_i, arsenic_i = k)) = \beta_1 + \beta_2 dist^*_i + \beta_3 k + \beta_4 dist^*_i \times k\]

\[{\rm logit}(P(Y_i|dist^*_i, arsenic_i = k + 1)) = \beta_1 + \beta_2 dist^*_i + \beta_3 (k + 1) + \beta_4 dist^*_i (k + 1) = \] \[ = \beta_1 + (\beta_2 + \beta_4) dist^*_i + \beta_3 (k + 1) + \beta_4 dist^*_i \times k \] So, the importance of distance as a predictor increases for households with higher existing arsenic levels (\(\hat \beta_2 + \hat \beta_4 = -0.58 - 0.18\))

Further, \[ {\rm logit}(P(Y_i|dist^*_i, arsenic_i = k + 1)) - {\rm logit}(P(Y_i|dist^*_i, arsenic_i = k)) = \beta_3 + \beta_4 dist^{*}_i\] So, at the average distance the difference in estimated logit probabilities is 0.46

coef(fit4)[3] + coef(fit4)[4]*mean(data$dist100)
  arsenic 
0.4695081 

Logistic regression: Including interaction term

Interactions

The interaction term can be seen (on the logit scale): as the effect (-0.18) added to the coefficient for arsenic for each additional 100 meters of distance; indeed

\[{\rm logit}(P(Y_i|dist^*_i=c, arsenic_i)) = \beta_1 + \beta_2 c + \beta_3 arsenic_i + \beta_4 c \times arsenic_i\]

\[{\rm logit}(P(Y_i|dist^* = c+1, arsenic)) = \beta_1 + \beta_2 (c+1) + \beta_3 arsenic_i + \beta_4 (c+1) arsenic_i = \] \[ = \beta_1 + \beta_2(c+1) + (\beta_3 + \beta_4) arsenic_i + \beta_4 c \times arsenic_i \] So, the importance of arsenic as a predictor decreases for households that are farther from the existing safe wells (\(\hat \beta_2 + \hat \beta_4 = 0.56 - 0.18\))

Further, \[ {\rm logit}(P(Y_i|dist^*_i=c+1, arsenic_i = k)) - {\rm logit}(P(Y_i|dist^*_i=c, arsenic_i = k)) = \beta_2 + \beta_4 arsenic_i\] So, at the average value of arsenic the difference in estimated logit probabilities is -0.88.

coef(fit4)[2] + coef(fit4)[4]*mean(data$arsenic)
   dist100 
-0.8736527 

Logistic regression: Including interaction term

Interactions

The following plots show that the differences in switching associated with differences in arsenic level are larger if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing with the same plot without interaction, note that the two curves mainly differ for higher values of distance.

par(mfrow = c(1,2))
# Pr(Switching) vs Distance (Consider Arsenic=0.5 and Arsenic = 1)
with(data, plot(data$dist, switch.jitter, type = "n",
                xlab = "Distance", ylab = "Pr(Switching)"))
curve(InvLogit(cbind(1, x/100, 1.5, x/100) %*% coef(fit4)), add = TRUE, col = "red")
curve(InvLogit(cbind(1, x/100, 0.5, 0.5 * x/100) %*% coef(fit4)), add = TRUE, col = "blue")
text(100, 0.6,"if As = 1.5", col = "red"); text(35, 0.4,"if As = 0.5", col = "blue")
points(data$dist, switch.jitter, pch = 20, cex = 0.01)

# Pr(Switching) vs Arsenic (Consider Distance = 0 and Distance = 50)
with(data, plot(data$arsenic, switch.jitter, type = "n",
                xlab = "Arsenic concentration", ylab = "Pr(Switching)"))
curve(InvLogit(cbind(1, 0, x, 0) %*% coef(fit4)), add = TRUE, col = "red")
curve(InvLogit(cbind(1, 0.5, x, 0.5 * x) %*% coef(fit4)), add = TRUE, col = "blue")
text(2, 0.9, "if Dist = 0", col = "red"); text(3, 0.6, "if Dist = 50", col = "blue")
points(data$arsenic, switch.jitter, pch = 20, cex = 0.01)

Logistic regression: Including interaction term

Interactions

The following plots show that the differences in switching associated with differences in arsenic level are larger if you are close to a safe well, but with a diminishing effect if you are far from any safe well. Comparing with the same plot without interaction, note that the two curves mainly differ for higher values of distance.

Logistic regression: Model Comparison and Model Selection

Model Comparison

More precisely, let \(\boldsymbol \beta = (\boldsymbol \beta^\top_0, \boldsymbol \beta^\top_1)^\top\), where \(\boldsymbol \beta_0=(\beta_0, \beta_1, \ldots, \beta_{p_0})\) and \(\boldsymbol \beta_1=(\beta_{p_0 + 1}, \beta_{p_0 + 2}, \ldots, \beta_{p})\). We want test,
\[ \begin{cases} H_{0}: & \beta_{p_0 + 1} = \beta_{p_0 + 2} = \ldots = \beta_{p} = 0\\ H_{1}: & \exists r \in \{p_0+1, \ldots, p\}: \beta_{r} \neq 0 \end{cases} \] So, the LRT based on the deviance difference is (asymptotically) distributed as \(\chi^2_{p-p_0}\), and allows a comparison among models.

anova(fit4, test="LRT") 
Analysis of Deviance Table

Model: binomial, link: logit

Response: switch

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                             3019     4118.1              
dist100          1   41.861      3018     4076.2 9.798e-11 ***
arsenic          1  145.570      3017     3930.7 < 2.2e-16 ***
dist100:arsenic  1    3.040      3016     3927.6   0.08124 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note

The residuals deviance drops significantly adding the distance and the arsenic variables, while marginal benefits are obtained by considering the interaction effect.

Logistic regression: Model Comparison and Model Selection

Model Selection

The model selection can be carried out also by using the AIC. It judges a model by how close its fitted values tend to be to the true expected values, as summarized by a certain expected distance between the two. Because smaller AIC is better, the AIC penalizes a model for having many parameters.

c( AIC(fit1), AIC(fit3), AIC(fit4))
[1] 4080.238 3936.668 3935.628

Note

The conclusions are similar to those reported above.

Exercise Explore the Anova() function of the car package.

Logistic regression: Binned Residuals

Binned Residuals

Outcome variable in the logistic regression can take only 0 and 1 values. The plot of the residuals defined as \[{\rm residual}_i = y_i - \hat {\pi}_i = y_i-\text{logit}^{-1}({\mathbf x}^\top_i\hat \beta)\] versus the fitted values (\(\hat \pi_i= \text{logit}^{-1}({\mathbf x}^\top_i\hat \beta))\) is not useful, that is considering \((\hat \pi_i, y_i - \hat \pi_i)\), because the points belong to two parallel lines with slopes -1. Let’s consider the model labelled as fit4.

pred4 <- predict(fit4, type = "response")
res4 <- residuals(fit4, type = "response")

plot(c(0, 1), c(-1, 1), xlab = "Estimated Pr(Switching)", type = "n",
     ylab = "Observed - Estimated", main = "Residual Plot")
abline(h = 0, col = "gray", lwd = 0.5)
points(pred4, res4, pch = 20, cex = 0.2)
abline(c(0, -1), col = "red", lwd = 0.25)
abline(c(1, -1), col = "red", lwd = 0.25)

Logistic regression: Binned Residuals

Binned Residuals

Outcome variable in the logistic regression can take only 0 and 1 values. The plot of the residuals defined as \[{\rm residual}_i = y_i - \hat {\pi}_i = y_i-\text{logit}^{-1}({\mathbf x}^\top_i\hat \beta)\] versus the fitted values (\(\hat \pi_i= \text{logit}^{-1}({\mathbf x}^\top_i\hat \beta))\) is not useful, that is considering \((\hat \pi_i, y_i - \hat \pi_i)\), because the points belong to two parallel lines with slopes -1. Let’s consider the model labelled as fit4.

Logistic regression: Binned Residuals

Binned Residuals

A more readable plot can be made using the binned residuals defined as: Dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin (Gelman and Hill, 2007, page 97).

The number of bins have to be chosen such that each bin is computed on enough points such that the resulting plot is not too noisy (high number of bins) but can highlight patterns in the residuals (hided if the number of bins is too low). Below, there is the function for obtaining the binned residuals (there are little changes w.r.t to the binned.resids() function of the arm package). The inputs are

  • x: represents the fitted values
  • y: represents the observed - fitted values
  • nclass: number of categories (bins)
binned.resids <- function(x, y, nclass = sqrt(length(x))){
 breaks.index <- floor(length(x) * (1 : (nclass))/nclass)
 breaks <- c(-Inf, sort(x)[breaks.index], Inf)
 output <-  xbreaks <- NULL 
 x.binned <- as.numeric(cut(x, breaks))
 for(i in 1:nclass){
  items <- (1:length(x))[x.binned == i]
  x.range <- range(x[items])
  xbar <- mean(x[items])
  ybar <- mean(y[items])
  n <- length(items)
  sdev <- sd(y[items])
  output <- rbind(output, c(xbar, ybar, n, x.range, 2 * sdev/sqrt(n)))
 }
 colnames(output) <- c("xbar", "ybar", "n", "x.lo", "x.hi", "2se")
 return(list(binned = output, xbreaks = xbreaks))
}

Logistic regression: Binned Residuals

Binned Residuals

The following plot shows the binned residuals vs the the estimated probability of switching. Grey lines depict \(\pm 2\) standard error (se) bounds, so we expect that 95% of the binned residuals falls between these two lines. The se is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.

br4 <- binned.resids(pred4, res4, nclass = 50)$binned
plot(range(br4[,1]), range(br4[,2],br4[,6], -br4[,6]),  type = "n",
     xlab = "Estimated Pr(Switching)", ylab = "Average residual", main = "Binned Residual Plot")
abline(h = 0, col = "gray", lwd = 0.5); points(br4[,1], br4[,2], pch = 19, cex = 0.5)
lines(br4[,1], br4[,6], col = "gray", lwd = 0.5); lines(br4[,1], -br4[,6], col = "gray", lwd = 0.5)

Logistic regression: Binned Residuals

Binned Residuals

The following plot shows the binned residuals vs the the estimated probability of switching. Grey lines depict \(\pm 2\) standard error (se) bounds, so we expect that 95% of the binned residuals falls between these two lines. The se is defined as \(\sqrt{p(1-p)/n}\), where \(n\) is the number of data used to compute each average residual.

Logistic regression: Binned Residuals

Binned Residuals

A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight pattern: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations. So we plot the binned residuals vs the the predictors.

par(mfrow = c(1, 2))
br.dist <- binned.resids(data$dist, res4,
                         nclass = 40)$binned
plot(range(br.dist[,1]), range(br.dist[,2], br.dist[,6], -br.dist[,6]), type = "n",
     xlab = "Distance", ylab = "Average residual", main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.dist[,1], br.dist[,6], col = "gray", lwd = 0.5)
lines(br.dist[,1], -br.dist[,6], col = "gray", lwd = 0.5)
points(br.dist[,1], br.dist[,2], pch = 19, cex = 0.5)

br.arsenic <- binned.resids(data$arsenic, res4,
                         nclass = 40)$binned
plot(range(br.arsenic[,1]), range(br.arsenic[,2], br.arsenic[,6], -br.arsenic[,6]), type = "n",
     xlab = "Arsenic concentration", ylab = "Average residual", main = "Binned residual plot")
abline(h = 0, col = "gray", lwd = 0.5)
lines(br.arsenic[,1], br.arsenic[,6], col = "gray", lwd = 0.5)
lines(br.arsenic[,1], -br.arsenic[,6], col = "gray", lwd = 0.5)
points(br.arsenic[,1], br.arsenic[,2], pch = 19, cex = 0.5)

Exercise

Propose a possible solution to improve the model and fit your model. Compare the new binned residual plots with the previous ones.

Logistic regression: Binned Residuals

Binned Residuals

A pattern can be identified looking at the plot of the binned residuals with respect to the arsenic level. Indeed, there are large negative binned residuals for some households with wells with small values of arsenic concentration. These points correspond to people that are less likely to switch well with respect to what predicted by the model (almost \(-20\%\)). Moreover, the distribution of the residuals shows a slight pattern: the model seems to underestimate the probabilities of switching for medium arsenic level values while overestimates for high arsenic concentrations. So we plot the binned residuals vs the the predictors.

Logistic regression: Error rate

Error rate

Error rate is defined as the proportion of cases for which the fitted probabilities are above a threshold (commonly 0.5) and the outcome value is 0 or the fitted probabilities are below the specified threshold and the outcome value is 1, that is when the prediction-guessing is wrong: \[er = \frac{1}{n}\sum_{i=1}^{n}((y_i=0 \, \& \, \hat p_i >0.5)\, | \, (y_i=1 \, \& \, \hat p_i <0.5))\] It should be a value between 0 (perfect prediction-guessing) and 0.5 (random prediction-guessing).

Usually we expect the error rate to be lower than the error rate of the null model, that is the model including only the intercept. The estimated proportion for the null model will be \(\hat p=\sum_i y_i/n\), that is the proportion of 1’s in the data. The error rate for the null model is \({\rm min}(p, 1-p)\), and it corresponds to assign 1 (or 0) to all the fitted values.

[1] 0.4248344

Logistic regression: Error rate

Error rate

So if our model improve the prediction-guessing of a simple logistic regression, we expect the error rate of our model to be lower than the error rate for the null model. Our last fitted model (model labelled as fit4), with an error rate of \(0.38\), correctly predicts \(62\%\) of the switching choices, determining only a \(4\%\) of improvement if compared to the simply guessing that all the households will switch.

[1] 0.3758278

Error rate

Consider the error rate as the equivalent of the \(R^2\) for the linear model, it can be a useful measure of the model fit but can’t substitute a deeper evaluation of the model looking at coefficients significance and at the diagnostic plot. Analysing error rate means to extract information from the confusion matrix. Note that we can vary the threshold 0.5, and better assessment can be done by analysing the ROC curve.

       
           0    1
  FALSE  442  294
  TRUE   841 1443
[1] 0.6241722
[1] 0.3758278

Exercise

Compute the error rate of your proposed model and compare it with the previous one.

Poisson Regression

Poisson Regression

Poisson Regression

Let’s assume to have a response variable representing a count. A sensible choice is to model \(y_i\), \(i=1, \ldots, n\), considering a Poisson distribution, which belongs to the exponential dispersion family and so a GLM can be fitted. The counts have a positive mean and it is sensible to consider a log-link function \[g(\mu_i) = \log(\mu_i)= \sum_{j=1}^{p} x_{ij} \beta_j, \quad i=1, \ldots, n\]

Then, the mean satisfies the \[\mu_i = E[Y_i|\mathbf x_{i}] = \exp \bigg(\sum_{j=1}^{p} x_{ij} \beta_j\bigg) \]

The mean of \(Y\) at \(x_{ij}=c+1\) is equal to the mean at \(x_{ij}=c\) times \(\exp(\beta_j)\), adjusted for the other explanatory variables, that is

\[E[Y_i|x_{ik}=c+1, \ldots] = \exp\bigg(\sum_{j\neq k}^{} x_{ij} \beta_j + (c+1)\beta_k\bigg) \]

\[E[Y_i|x_{ik}=c, \ldots] = \exp\bigg(\sum_{j\neq k}^{} x_{ij} \beta_j + c\beta_k\bigg) \]

\[\frac{E[Y_i|x_{ik}=c+1, \ldots]}{E[Y_i|x_{ik}=c, \ldots]} = \exp(\beta_k)\] So the expected values of \(Y\) for a unit change in \(X\) is \(e^{\beta_1}\) the expected values of \(X\).

Poisson Regression: Example

Example: Modelling claims frequency

The dataset SingaporeAuto.txt contains information on the number of accidents for a subset of data collected by the General Insurance Association of Singapore. Also some characteristics of the driver and the car are available. Only the data that refer to automobiles will be considered here.

The purpose of the analysis is to understand the impact of vehicle and driver characteristics on accident experience. These relationships represent the base for actuaries working in rate making, i.e., setting the price of the insurance coverages.

The variables and their description are the following

  • Female: Gender of Insured (female=1, male = 0)
  • PC: private vehicle (yes=1, no = 0)
  • Clm_Count: Number of claims during the year
  • Exp_weights: the fraction of the year the policy is in effect (an exposure variable)
  • NCD: No Claim discount (higher the discount, better is the prior accident record)
  • AgeCat: The age of the policy holder grouped into 7 categories (0,6) (0 = less than 21 and the other refer respectively to 22–25, 26–35, 36–45, 46–55, 56–65, over 66)
  • VAgeCat: The age of the vehicle in 6 categories (0–6 indicat groups 0,1,2,3–5,6–10,11–15, 16 and older)

Poisson Regression: Example

Example: Modelling claims frequency

Thus, we read the data and we give a look at the first 5 records

accidents <- read.table("SingaporeAuto.txt",header=T)
accidents[1:5,]
   Female PC Clm_Count Exp_weights NCD AgeCat VAgeCat
34      0  1         0  0.41889117  40      3       6
35      0  1         0  0.08487337  20      4       0
36      0  1         0  0.83778234  50      5       0
37      0  1         0  0.54757016  10      4       6
38      0  1         1  0.89527721  20      3       6

Example: Modelling claims frequency

Note that the last 3 variables are actually categorical variables and can be convenient to redefine them. Note also that for automobile data not all the categories are observed.

accidents$NCD <- as.factor(accidents$NCD)
accidents$AgeCat <- as.factor(accidents$AgeCat)
accidents$VAgeCat <- as.factor(accidents$VAgeCat)

Poisson Regression: Example

Example: Modelling claims frequency

The dependent variable is the count variable Clm_Count. Let first give a look at its distribution.

table(accidents$Clm_Count)

   0    1    2    3 
3555  271   15    1 
barplot(table(accidents$Clm_Count))

Poisson Regression: Example

Example: Modelling claims frequency

By comparing the relative frequencies and probability from a Poisson with \(\lambda\) given by the sample mean (we do not conduct a formal test), it seems sensible to consider a Poisson regression model to fit this data

table(accidents$Clm_Count)/nrow(accidents)

           0            1            2            3 
0.9252993233 0.0705361791 0.0039042166 0.0002602811 
round(dpois(0:3, mean(accidents$Clm_Count)),6)
[1] 0.923924 0.073106 0.002892 0.000076

Poisson Regression: Example

Example: Modelling claims frequency

We can now try to estimate a Poisson regression model including all the available covariates (except Exp_weights for reason that we will clarified in a couple of pages)

model1 <- glm(Clm_Count ~ Female + AgeCat + VAgeCat + NCD,
              family = poisson, data = accidents)
summary(model1)

Call:
glm(formula = Clm_Count ~ Female + AgeCat + VAgeCat + NCD, family = poisson, 
    data = accidents)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.46995    0.31001  -7.967 1.62e-15 ***
Female      -0.15477    0.15528  -0.997  0.31890    
AgeCat3      0.26074    0.31539   0.827  0.40839    
AgeCat4      0.19969    0.32133   0.621  0.53430    
AgeCat5      0.03219    0.35342   0.091  0.92743    
AgeCat6      0.61281    0.39180   1.564  0.11780    
AgeCat7      0.78894    0.77462   1.018  0.30845    
VAgeCat1     0.27525    0.16972   1.622  0.10486    
VAgeCat2     0.46387    0.15848   2.927  0.00342 ** 
VAgeCat6    -0.25227    0.71209  -0.354  0.72314    
NCD10       -0.21008    0.17788  -1.181  0.23761    
NCD20       -0.45123    0.21020  -2.147  0.03182 *  
NCD30       -0.42920    0.21532  -1.993  0.04623 *  
NCD40       -0.78419    0.26838  -2.922  0.00348 ** 
NCD50       -0.64799    0.15494  -4.182 2.89e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1590.5  on 3841  degrees of freedom
Residual deviance: 1551.8  on 3827  degrees of freedom
AIC: 2166

Number of Fisher Scoring iterations: 6

Poisson Regression: Example

Example: Modelling claims frequency

Comments:

  • The estimated female effect \(\hat \beta_2 \approx -0.15\) means that the expected number of claims is lower for female than male, adjusted for the remaining variables; that is, for a fixed level of the other variables, the estimated expected number of claims for a female is equal to the estimated expected number of claims for a male times \(\exp(-0.15)=0.86\)

\[\hat E[Y_i|female_i=1, \ldots] = \exp(-0.15)\, \hat E[Y_i|female_i=0, \ldots]\]

exp(coef(model1)[2]) 
   Female 
0.8566153 

Example: Modelling claims frequency

Otherwise the log expected number of claims for female is -0.15 lower than log expected number of claims for male

  • The Wald 95% confidence interval is \(\exp(-0.155 \pm 1.96 \times 0.155)\), that is \((0.63,1.16)\)
exp(coef(model1)[2] + c(-1,1) * summary(model1)$coefficients[2, 2] * qnorm(0.975) )
[1] 0.631850 1.161335

Poisson Regression: Example

Example: Modelling claims frequency

  • Let us focus on the age (of the individual) variables. It seems there are no significant difference among the different ages. However, since the age variable is considered as categorical, the interpretation must be done w.r.t. to the reference category, here \(AgeCat = 2\) (26-35 years). Thus, the estimates for the age categories suggest that the expected number of claims is higher for older people, adjusting for the remaining covariates. For instance, given a value of the remaining covariates, the expected number of claims of \(AgeCat = 3\) is \(\exp(0.26)= 1.30\) times the expected number of claims of \(AgeCat = 2\), that is

\[\hat E[Y_i|Agecat_i=3, \ldots] = \exp(0.26)\, \hat E[Y_i|weight_i=2, \ldots ]\]

exp(coef(model1)[3]) 
AgeCat3 
1.29789 

Poisson Regression: Example

Example: Modelling claims frequency

We can analyse the LRT, noticing the only very small p-values is associated to NCD (marginal evidence that the subgorup of coefficients associated to VAgeCat are statistically different from zero)

anova(model1, test = "LRT")
Analysis of Deviance Table

Model: poisson, link: log

Response: Clm_Count

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     3841     1590.5              
Female   1   0.6579      3840     1589.8 0.4173034    
AgeCat   5   6.2897      3835     1583.6 0.2790450    
VAgeCat  3   9.0632      3832     1574.5 0.0284618 *  
NCD      5  22.7046      3827     1551.8 0.0003844 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Poisson Regression: Example

Example: Modelling claims frequency

Let’s take a look to a bunch of nested models via AIC

       df      AIC
model0  1 2176.704
model1 15 2165.989
model2  9 2159.937
model3  6 2163.042

Example: Modelling claims frequency

We can clearly see that the models without gender and individual age have a lower AIC than the full model.

Although this is not the case, there are several applications (See the extra material example) were you can deal with the overdispersion problems. In such cases, some possible solutions are represented by

  • Considering the negative binomial distribution
  • Consider the Quasi-Poisson model.

Poisson Regression: Example

Example: Modelling claims frequency. Taking into account exposure

The cars here considered have been insured only for a part of the year, and so we should take into account this because exposure is different for each car and cars less exposed have a lower number of accidents. We should rather model the rate, that is \[ \log(\mu_i/offset_i) = \sum_{j = 1}^{p}x_{ij}\beta_j \quad \text{equivalent to} \quad \log(\mu_i) = \sum_{j = 1}^{p}x_{ij}\beta_j + \log(offset_i)\] where \(\log(offset_i)\) is called offset. In this specific case we want to model the rate \(Clm\_Count/Exp\_weights\). But setting a log–linear model for this variable is equivalent at introducing into a Poisson regression model the logarithm of the variable (Exp_weights) as an offset. This can be done directly by writing (note that parameters do not change much, but take a look to the AIC):

model2 <- glm(Clm_Count ~ Female + AgeCat + VAgeCat + NCD, offset = log(Exp_weights),
              family = poisson, data = accidents)
summary(model2)

Call:
glm(formula = Clm_Count ~ Female + AgeCat + VAgeCat + NCD, family = poisson, 
    data = accidents, offset = log(Exp_weights))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.65003    0.31027  -5.318 1.05e-07 ***
Female      -0.17460    0.15524  -1.125  0.26072    
AgeCat3      0.16595    0.31540   0.526  0.59879    
AgeCat4      0.11903    0.32142   0.370  0.71115    
AgeCat5     -0.05699    0.35314  -0.161  0.87179    
AgeCat6      0.49974    0.39179   1.276  0.20212    
AgeCat7      0.69047    0.77848   0.887  0.37511    
VAgeCat1     0.19944    0.16977   1.175  0.24008    
VAgeCat2     0.40639    0.15863   2.562  0.01041 *  
VAgeCat6    -0.55839    0.71390  -0.782  0.43412    
NCD10       -0.24121    0.17828  -1.353  0.17605    
NCD20       -0.46202    0.20984  -2.202  0.02768 *  
NCD30       -0.46611    0.21597  -2.158  0.03091 *  
NCD40       -0.82628    0.26847  -3.078  0.00209 ** 
NCD50       -0.72454    0.15521  -4.668 3.04e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1514.5  on 3841  degrees of freedom
Residual deviance: 1472.6  on 3827  degrees of freedom
AIC: 2086.8

Number of Fisher Scoring iterations: 6